PLSC30500, Fall 2024

Part 4. Regression (part b)

Andy Eggers

The math behind OLS

Motivation

OLS estimator: \[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\, \left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2\]

Could solve by trying every combination of \(b_0, b_1, \ldots, b_K\).



Let’s try to understand a more efficient solution: \[\hat{\boldsymbol{\beta}} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbf{Y}\]

The case of one predictor

\[(\hat{\beta_0}, \hat{\beta_1}) = \underset{(b_0, b_1) \in \mathbb{R}^{2}}{\arg\min} \, \frac{1}{n} \sum_{i =1}^n\, \left(Y_i - (b_0 + b_1X_i)\right)^2\]

Expanding the sum, \(S\), we get

\[\begin{align} S &= \sum_{i = 1}^n \left(Y_i^2 - 2\beta_0 Y_i - 2\beta_1 X_iY_i + \beta_0^2 + 2\beta_0 \beta_1 X_i + \beta_1^2 X_i^2\right) \\ &= \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2 \end{align}\]


We want to choose \(\beta_0\) and \(\beta_1\) to minimize \(S\).

Review: optimization via differentiation

How would you choose \(\beta_0\) to minimize \(S = -2\beta_0 + \beta_0^2\)?

Observing that the function is flat only at the minimum, we use differentiation to obtain the first order condition:

\[\frac{\partial S}{\partial \beta_0} = -2 + 2 \beta_0 = 0\]

So the solution is \(\beta_0 = 1\).

Optimization in two dimensions

How would you choose \(\beta_0\) and \(\beta_1\) to minimize \(S = -2\beta_0 + \beta_0^2 - 3 \beta_1 + \beta_1^2 + \beta_0 \beta_1\)?

Taking same approach, first order conditions are

\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 + 2 \beta_0 + \beta_1 = 0 \\ \frac{\partial S}{\partial \beta_1} &= -3 + \beta_0 + 2\beta_1 = 0 \end{align}\]

A system of equations!

\[\begin{align} 2 \beta_0 + \beta_1 = 2 \\ \beta_0 + 2\beta_1 = 3 \end{align}\]

By isolating and substituting, we get \(\beta_0 = 1/3\) and \(\beta_1 = 4/3\).

Back to our problem

To find \(\beta_0\) and \(\beta_1\) that minimize \[S = \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2,\] we use the same approach.

\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 \sum Y_i + 2 n \beta_0 + 2\beta_1 \sum X_i &= 0\\ \frac{\partial S}{\partial \beta_1} &= -2 \sum X_i Y_i + 2 \beta_0 \sum X_i + 2 \beta_1 \sum X_i^2 &= 0 \end{align}\]

System of equations:

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

Solving our system of equations

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

Isolating \(\beta_0\) in the first equation, we get \[ \beta_0 = \frac{\sum Y_i}{n} - \beta_1 \frac{\sum X_i}{n}\]

Substituting into second equation, rearranging, dividing top & bottom by \(n\), we get \[\beta_1 = \frac{\frac{\sum X_iY_i}{n} - \frac{\sum X_i}{n}\frac{\sum Y_i}{n}}{\frac{\sum X_i^2}{n} - \left(\frac{\sum X_i}{n}\right)^2}\]

These are the sample analogues of the bivariate BLP (where \(\beta_1 = \frac{\text{Cov}[X,Y]}{{\textrm V}\,[X]}\)).

Where are we?

We have shown how, with a bit of calculus, we can derive the OLS estimator in the case with one predictor.

If we want \(K > 1\) predictors, could do the same thing:

  • expand the sum \(S\)
  • obtain \(K + 1\) first order conditions
  • solve by isolating and substituting

A horrible mess.

To make things easier, we turn to some linear algebra.

Matrix basics

A matrix

A matrix \(\mathbb{B}\) is a rectangular table of numbers, e.g. 

\[ \mathbb{B} = \begin{bmatrix} 1 & 2 \\ 4 & 0 \\ -1 & 3 \end{bmatrix}\]

This is a \(3\times 2\) matrix (rows \(\times\) columns).


The transpose of \(\mathbb{B}\), written \(\mathbb{B}^T\) or \(\mathbb{B}'\), is

\[ \mathbb{B}^T = \begin{bmatrix} 1 & 4 & -1\\ 2 & 0 & 3 \end{bmatrix}.\]

\(k\)th row of \(\mathbb{B}\) is \(k\)th column of \(\mathbb{B}^T\) (and vice versa).

Matrix multiplication

  • A column vector of length \(K\) is a matrix with one column and \(K\) rows.
  • A row vector of length \(K\) is a matrix with one row and \(K\) columns.

When we multiply a matrix \(\mathbb{B}\) with \(K\) columns by a column vector of length \(K\), \(\mathbf{D}\), we get a column vector of length \(K\):

Matrix multiplication more generally

Matrix manipulation in R

# equivalent ways to make B:
B <- cbind(c(1, 4, -1), c(2, 0, 3))
(B <- matrix(data = c(1, 4, -1, 2, 0, 3), ncol = 2))
     [,1] [,2]
[1,]    1    2
[2,]    4    0
[3,]   -1    3
t(B)
     [,1] [,2] [,3]
[1,]    1    4   -1
[2,]    2    0    3
# also equivalent:
D <- c(2, 1)
(D <- matrix(data = c(2, 1), ncol = 1))
     [,1]
[1,]    2
[2,]    1
B %*% D
     [,1]
[1,]    4
[2,]    8
[3,]    1

Can only multiply \(\mathbb{B}\) and \(\mathbf{D}\) if number of columns in \(\mathbb{B}\) equals number of rows in \(\mathbf{D}\).

Our OLS problem in matrix algebra

Recall our system of equations:

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

We can write this in matrix form as

\[ \mathbb{B} \boldsymbol{\beta} = \mathbf{D} \]

where

\[\mathbb{B} = \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix}, \, \, \, \,\, \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \, \, \, \,\, \mathbf{D} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]

The inverse: concept

The inverse of a scalar (number) \(a\) is written \(a^{-1}\), and \(a \times a^{-1} = 1\). (e.g. \(3^{-1} = 1/3\), so \(3 \times 1/3 = 1\).)

We multiply both sides by the inverse of \(a\) when we do

\[\begin{align} a x &= b \\ x &= b/a \end{align}\]


The inverse of a matrix \(\mathbb{B}\) is written \(\mathbb{B}^{-1}\).

We can similarly solve our problem using the inverse of \(\mathbb{B}\):

\[\begin{align} \mathbb{B} \boldsymbol{\beta} &= \mathbf{D} \\ \boldsymbol{\beta} &= \mathbb{B}^{-1} \mathbf{D} \end{align}\]

This is a useful way to represent the solution to a system of equations.

The inverse: implementation

There are manual procedures for inverting matrices. We use R’s solve() function:

# creating our matrices for one toy dataset
x <- c(1, 2, 3)
y <- c(1, 3, 3)
(B <- cbind(c(length(x), sum(x)), c(sum(x), sum(x^2))))
     [,1] [,2]
[1,]    3    6
[2,]    6   14
(D <- c(sum(y), sum(x*y)))
[1]  7 16
solve(B)%*%D # using inverse to solve the system of eqns
          [,1]
[1,] 0.3333333
[2,] 1.0000000
# checking the solution
lm(y ~ x) |> coef() 
(Intercept)           x 
  0.3333333   1.0000000 

So we have (1) a way to represent systems of equations and (2) a way to solve them.

The general OLS estimator in matrix form

OLS with \(K\) predictors

Let \(\mathbb{X}\) be the \(n \times K + 1\) regressor matrix (a column of 1s and 1 column per predictor):

\[\mathbb{X} = \begin{bmatrix} 1 & X_{[1]1} & X_{[2]1} & \cdots & X_{[K]1} \\ 1 & X_{[1]2} & X_{[2]2} & \cdots& X_{[K]2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{[1]n} & X_{[2]n} & \cdots & X_{[K]n} \end{bmatrix}\]

Let \(\boldsymbol{\beta}\) be the \(K+1\)-length column vector of coefficients: \[\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{K} \end{bmatrix}\]

Let \(\mathbf{Y}\) be the \(n\)-length column vector of outcomes: \[\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{n} \end{bmatrix}\]

OLS with \(K\) predictors (2)

For a given coefficient vector \(\boldsymbol{\beta}\),

  • \(\mathbb{X}\boldsymbol{\beta}\) is a column vector of predictions
  • \(\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}\) is a column vector of residuals
  • \((\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})\) is the sum of squared residuals

OLS estimator in matrix form:

\[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) \]

Now we do minimization via differentiation for this version of the problem.

Minimization via differentiation (again)

Expand the sum:

\[ (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) = \mathbf{Y}^T \mathbf{Y} - \mathbf{Y}^T \mathbb{X} \boldsymbol{\beta} - \boldsymbol{\beta}^T \mathbb{X}^T \mathbf{Y} + \boldsymbol{\beta}^T\mathbb{X}^T \mathbb{X} \boldsymbol{\beta}\]

Differentiating with respect to \(\boldsymbol{\beta}\) gives us the first order condition(s), \[2 \mathbb{X}^T \mathbb{X}\boldsymbol{\beta} - 2 \mathbb{X}^T \mathbf{Y} = 0,\]

which becomes the system of equations \[\mathbb{X}^T \mathbb{X}\boldsymbol{\beta} = \mathbb{X}^T \mathbf{Y} \]

with solution \[ \boldsymbol{\beta} = \left(\mathbb{X}^T \mathbb{X}\right)^{-1} \mathbb{X}^T \mathbf{Y}.\]

Back to our single-predictor case

Our system of equations was:

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

In matrix form:

\[\begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]

This is \(\mathbb{X}^T \mathbb{X} \boldsymbol{\beta} = \mathbb{X}^T \mathbb{Y}\):

\[\begin{align}\mathbb{X}^T \mathbb{X} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_3 \end{bmatrix}\\ &= \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2\end{bmatrix} \end{align}\]

\[\begin{align}\mathbb{X}^T \mathbf{Y} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \\ &= \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix} \end{align}\]

Illustration with CCES data

dat <- read.csv("./../data/cces_2012_subset.csv")
use <- !is.na(dat$env) & !is.na(dat$gender) & !is.na(dat$aa)
X <- cbind(1, dat$gender[use], dat$aa[use])
head(X, 3)
     [,1] [,2] [,3]
[1,]    1    1    4
[2,]    1    2    1
[3,]    1    2    4
Y <- dat$env[use]
head(Y, 3)
[1] 4 1 5
solve(t(X) %*% X) %*% t(X) %*% Y
           [,1]
[1,] 1.14259962
[2,] 0.07957051
[3,] 0.65731827
lm(env ~ gender + aa, data = dat)

Call:
lm(formula = env ~ gender + aa, data = dat)

Coefficients:
(Intercept)       gender           aa  
    1.14260      0.07957      0.65732  

Summing up

Most important to know about OLS:

  • OLS (minimizing sum/mean of squared residuals in sample) is an estimator of the BLP
  • you decide which BLP you want to approximate

Today we have added some intuition about how OLS is estimated:

  • we minimize the sum/mean of squared residuals using differentiation
  • the first order conditions constitute a system of equations, one for each coefficient
  • we can solve that system by isolating and substituting
  • or we can use matrix algebra: \(\boldsymbol{\beta} = (\mathbb{X}^T\mathbb{X})^{-1} \mathbb{X}^T \mathbf{Y}\)

More resources

For walk-through of the optimization part of matrix OLS, see Ben Lambert’s econometrics videos:

Also see video on variance of OLS estimator given homoskedasticity